This comprehensive tutorial demonstrates the implementation of RNA velocity analysis using scVelo, a state-of-the-art Python package for single-cell RNA sequencing (scRNA-seq) data analysis. Published in Nature Biotechnology (2020), scVelo represents a significant advancement in transcriptional dynamics analysis, building upon and enhancing the foundational RNA velocity framework while addressing limitations of its predecessor, velocyto.
RNA velocity analysis employs sophisticated mathematical modeling of transcriptional kinetics to unveil cellular dynamics that traditional scRNA-seq approaches cannot directly capture. This powerful methodology enables researchers to: 1. Determine the transcriptional state of genes (induction or repression) within specific cell populations 2. Predict cellular trajectories and fate decisions through pseudotime analysis 3. Reconstruct developmental hierarchies with unprecedented temporal resolution
By leveraging the analytical capabilities of scVelo, we can gain deep insights into cellular differentiation processes, developmental trajectories, and dynamic gene regulation in complex biological systems.
This tutorial integrates best practices from the official scVelo documentation while incorporating advanced analytical approaches for comprehensive single-cell analysis.
library(CellChat)
library(Seurat)
library(reticulate)
library(tidyverse)
library(svglite)
library(patchwork)
library(ggalluvial)
library(NMF)
set.seed(12345)
options(stringsAsFactors = FALSE)
reticulate::use_python("C:/Users/jwang/AppData/Local/Programs/Python/Python313/python.exe", required=T)
# setwd("D:/Data_Analysis_Practice_2025/scRNAseq/scRNAseq_Analysis_templates")Our analysis begins with a curated mouse brain dataset encompassing both disease and control samples. The initial data processing pipeline, executed in R using the Seurat framework, included: - Quality control and cell filtering - Data normalization and scaling - Dimensionality reduction - Batch effect correction - Unsupervised clustering analysis
A critical step in our workflow involves converting the processed Seurat object to AnnData format, the primary data structure utilized by scVelo. While the SeuratDisk package offers direct conversion capabilities, we implement a more robust custom conversion pipeline to ensure data integrity and compatibility. This approach provides greater control over the transfer of key analytical components between the two frameworks.
# save metadata table:
seurat_obj$barcode <- colnames(seurat_obj)
seurat_obj$UMAP_1 <- seurat_obj@reductions$umap@cell.embeddings[,1]
seurat_obj$UMAP_2 <- seurat_obj@reductions$umap@cell.embeddings[,2]
write.csv(seurat_obj@meta.data, file='metadata.csv', quote=F, row.names=F)
# write expression counts matrix
library(Matrix)
counts_matrix <- GetAssayData(seurat_obj, assay='RNA', slot='counts')
writeMM(counts_matrix, file=paste0(out_data_dir, 'counts.mtx'))
# write dimesnionality reduction matrix, in this example case pca matrix
write.csv(seurat_obj@reductions$pca@cell.embeddings, file='pca.csv', quote=F, row.names=F)
# write gene names
write.table(
data.frame('gene'=rownames(counts_matrix)),file='gene_names.csv',
quote=F,row.names=F,col.names=F
)import scanpy as sc
import anndata
from scipy import io
from scipy.sparse import coo_matrix, csr_matrix
import numpy as np
import os
import pandas as pd
# load sparse matrix:
X = io.mmread("counts.mtx")
# create anndata object
adata = anndata.AnnData(
X=X.transpose().tocsr()
)
# load cell metadata:
cell_meta = pd.read_csv("count/metadata.csv")
# load gene names:
with open("gene_names.csv", 'r') as f:
gene_names = f.read().splitlines()
# set anndata observations and index obs by barcodes, var by gene names
adata.obs = cell_meta
adata.obs.index = adata.obs['count/barcode']
adata.var.index = gene_names
# load dimensional reduction:
pca = pd.read_csv("count/pca.csv")
pca.index = adata.obs.index
# set pca and umap
adata.obsm['X_pca'] = pca.to_numpy()
adata.obsm['X_umap'] = np.vstack((adata.obs['UMAP_1'].to_numpy(), adata.obs['UMAP_2'].to_numpy())).T
# plot a UMAP colored by sampleID to test:
sc.pl.umap(adata, color=['SampleID'], frameon=False, save=True)
# save dataset as anndata format
adata.write('my_data.h5ad')
# reload dataset
adata = sc.read_h5ad('my_data.h5ad')RNA velocity analysis requires the distinction between spliced and unspliced transcripts, necessitating a specialized quantification approach beyond traditional UMI-based count matrices. This section details the construction of these distinct matrices using the velocyto command-line interface (CLI).
While both velocyto and Kallisto-Bustools are capable of generating these specialized matrices, we implement velocyto due to its: 1. Robust integration with Cell Ranger outputs 2. Flexible compatibility with diverse single-cell platforms 3. Established track record in RNA velocity analysis
The velocyto CLI offers seamless processing of: - Direct Cell Ranger output directories - Generic BAM files from any single-cell platform - Species-specific annotations (utilizing mm10 reference in this implementation) - Optional repeat masking for enhanced accuracy (recommended for optimal results)
This approach ensures accurate quantification of splicing states, crucial for downstream velocity calculations.
#!/bin/bash
#SBATCH --job-name=velocyto_run
#SBATCH --nodes=1
#SBATCH --ntasks=10
#SBATCH --cpus-per-task=4
#SBATCH --mem=240G
#SBATCH --time=24:00:00
#SBATCH --output=./SLURM_OUT/%x_%A_%a.out
#SBATCH --error=./SLURM_OUT/%x_%A_%a.err
transcriptome="/refdata-gex-GRCh38-2024-A/genes/genes.gtf"
BARCODES="/filtered_feature_bc_matrix/barcodes.tsv"
BAMFILE="/possorted_genome_bam.bam"
velocyto run -b $BARCODES -o ./velocyto $BAMFILE $transcriptomeNow that we have our input data properly formatted, we can load it into python. Velocyto created a separate spliced and unspliced matrix for each sample, so we first have to merge the different samples into one object. Additionally, I am reformatting the cell barcodes to match my anndata object with the full genes-by-cells data.
import scvelo as scv
import scanpy as sc
import cellrank as cr
import numpy as np
import pandas as pd
import anndata as adscv.settings.verbosity = 3
scv.settings.set_figure_params('scvelo', facecolor='white', dpi=100, frameon=False)
cr.settings.verbosity = 2
adata = sc.read_h5ad('count/my_data.h5ad')ldata_control_1 = sc.read('count/possorted_genome_bam_control_1.loom', cache=True)
ldata_control_2 = sc.read('count/possorted_genome_bam_control_2.loom', cache=True)
ldata_control_3 = sc.read('count/possorted_genome_bam_control_3.loom', cache=True)
ldata_control_4 = sc.read('count/possorted_genome_bam_control_4.loom', cache=True)
ldata_case_1 = sc.read('count/possorted_genome_bam_case_1.loom', cache=True)
ldata_case_2 = sc.read('count/possorted_genome_bam_case_2.loom', cache=True)
ldata_case_3 = sc.read('count/possorted_genome_bam_case_3.loom', cache=True)
ldata_case_4 = sc.read('count/possorted_genome_bam_case_4.loom', cache=True)
# rename barcodes in order to merge
barcodes = [bc.split(':')[1] for bc in ldata_control_1.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_1' for bc in barcodes]
ldata_control_1.obs.index = barcodes
barcodes = [bc.split(':')[1] for bc in ldata_control_2.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_3' for bc in barcodes]
ldata_control_2.obs.index = barcodes
barcodes = [bc.split(':')[1] for bc in ldata_control_3.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_4' for bc in barcodes]
ldata_control_3.obs.index = barcodes
barcodes = [bc.split(':')[1] for bc in ldata_control_4.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_7' for bc in barcodes]
ldata_control_4.obs.index = barcodes
barcodes = [bc.split(':')[1] for bc in ldata_case_1.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_2' for bc in barcodes]
ldata_case_1.obs.index = barcodes
barcodes = [bc.split(':')[1] for bc in ldata_case_2.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_5' for bc in barcodes]
ldata_case_2.obs.index = barcodes
barcodes = [bc.split(':')[1] for bc in ldata_case_3.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_6' for bc in barcodes]
ldata_case_3.obs.index = barcodes
barcodes = [bc.split(':')[1] for bc in ldata_case_4.obs.index.tolist()]
barcodes = [bc[0:len(bc)-1] + '_8' for bc in barcodes]
ldata_case_4.obs.index = barcodes
# make variable names unique
ldata_control_1.var_names_make_unique()
ldata_control_2.var_names_make_unique()
ldata_control_3.var_names_make_unique()
ldata_control_4.var_names_make_unique()
ldata_case_1.var_names_make_unique()
ldata_case_2.var_names_make_unique()
ldata_case_3.var_names_make_unique()
ldata_case_4.var_names_make_unique()
# concatenate the eight loom files
ldata = ldata_control_1.concatenate([ldata_control_2, ldata_control_3, ldata_control_4, ldata_case_1, ldata_case_2, ldata_case_3, ldata_case_4])
# merge matrices into the original adata object
adata = scv.utils.merge(adata, ldata)
adata.write('count/merged_data.h5ad')
adata = sc.read_h5ad('count/merged_data.h5ad')
print(adata)With our data properly structured, we proceed to the core analytical phase: RNA velocity computation using scVelo’s sophisticated modeling framework. The package offers two distinct mathematical approaches:
Our analysis pipeline begins with a thorough inspection of transcriptional states across cell clusters, followed by rigorous preprocessing steps. Initially, we implement the steady-state model with stochastic optimization to establish baseline velocity estimates.
Figure 2. proportions
The complex dynamics of cellular transcriptional states can be elegantly visualized through RNA velocity vector fields. By projecting velocity vectors onto two-dimensional embeddings (typically UMAP or t-SNE), we can observe: - Directional trends in gene expression changes - Potential developmental trajectories - Transition states between cell populations - Local and global patterns of transcriptional dynamics
These visualizations integrate velocity information across all genes and cells, providing a comprehensive view of cellular dynamics in the reduced dimensional space.
Figure 3. scvelo_embedding
scv.pl.velocity_embedding_grid(adata, basis='umap', color='clusters', save='embedding_grid.pdf', title='', scale=0.25)Figure 4. scvelo_embedding_grid
scv.pl.velocity_embedding_stream(adata, basis='umap', color=['clusters'], save='embedding_stream.pdf', title='')Figure 5. scvelo_embedding_stream
The RNA velocity framework enables sophisticated downstream analyses to extract biological insights from transcriptional dynamics. Our analytical pipeline encompasses:
These analyses collectively provide a comprehensive understanding of cellular dynamics and developmental trajectories.
scv.tl.rank_velocity_genes(adata, groupby='clusters', min_corr=.3)
# scv.DataFrame was removed in the past. Please use pandas.DataFrame, instead.
df = pd.DataFrame(adata.uns['rank_velocity_genes']['names'])
df.head()
df.to_csv('count/rank_velocity_pancrea_genes.csv', index=False)
df = pd.read_csv('count/rank_velocity_pancrea_genes.csv')
kwargs = dict(frameon=False, size=10, linewidth=1.5,
add_outline='Ngn3 low EP, Ngn3 high EP')
scv.pl.scatter(adata, df['Ngn3 low EP'][:3], ylabel='Ngn3 low EP', frameon=False, color='clusters', size=10, linewidth=1.5, save="Ngn3 low EP_scatter.pdf")Figure 7. excitatory_neurons_scatter
scv.pl.scatter(adata, df['Ngn3 high EP'][:3], ylabel='Ngn3 high EP', frameon=False, color='clusters', size=10, linewidth=1.5, save="Ngn3 high EP_scatter.pdf")Figure 8. scvelo_radial_glia_scatter
scv.tl.velocity_confidence(adata)
keys = 'velocity_length', 'velocity_confidence'
scv.pl.scatter(adata, c=keys, cmap='coolwarm', perc=[5, 95], save="velocity_confidence.pdf")Figure 9. scvelo_velocity_confidence
Figure 10. scvelo_velocity_graph
x, y = scv.utils.get_cell_transitions(adata, basis='umap', starting_cell=70)
ax = scv.pl.velocity_graph(adata, c='lightgrey', edge_width=.05, show=False, save="velocity_graph_lightgrey.pdf")
ax = scv.pl.scatter(adata, x=x, y=y, s=120, c='ascending', cmap='gnuplot', ax=ax, save="scv_pl_scatter.pdf")scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot', save="velocity_pseudotime.pdf")Figure 11. scvelo_velocity_pseudotime
adata.uns['neighbors']['distances'] = adata.obsp['distances']
adata.uns['neighbors']['connectivities'] = adata.obsp['connectivities']# Assume `adata` has been preprocessed (e.g., normalized, scaled, PCA)
sc.pp.neighbors(adata)
sc.tl.leiden(adata, key_added='clusters')
scv.tl.paga(adata, groups='clusters_colors')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
#df.style.background_gradient(cmap='Blues').format('{:.2g}')
scv.pl.paga(adata, basis='umap', size=50, alpha=.1,
min_edge_width=2, node_size_scale=1.5)When analyzing complex tissues with multiple distinct cellular lineages, a targeted approach focusing on specific cell populations can reveal nuanced biological insights that might be obscured in global analysis. In this section, we demonstrate:
This refined approach enables deeper insights into lineage-specific developmental programs and cellular state transitions.
cur_celltypes = ['Ductal', 'Ngn3 low EP', 'Ngn3 high EP', 'Pre-endocrine', 'Beta', 'Alpha', 'Delta', 'Epsilon']
adata_subset = adata[adata.obs['clusters_coarse'].isin(cur_celltypes)]
sc.pl.umap(adata_subset, color=['clusters'], frameon=False, title=['', ''], save="velocity_pl_umap.pdf")
sc.pp.neighbors(adata_subset, n_neighbors=15, use_rep='X_pca')Figure 12. umapvelocity_pl_umap
scv.pp.filter_and_normalize(adata_subset)
scv.pp.moments(adata_subset)
scv.tl.recover_dynamics(adata_subset)
scv.tl.velocity(adata_subset, mode='dynamical')
scv.tl.velocity_graph(adata_subset)
scv.pl.velocity_embedding_stream(adata_subset, basis='umap', color=['clusters'], save='embedding_stream.pdf', title='')Figure 13. scvelo_embedding_stream
df = adata_subset.var
df = df[(df['fit_likelihood'] > .1) & df['velocity_genes'] == True]
kwargs = dict(xscale='log', fontsize=16)
with scv.GridSpec(ncols=3) as pl:
pl.hist(df['fit_alpha'], xlabel='transcription rate', **kwargs)
pl.hist(df['fit_beta'] * df['fit_scaling'], xlabel='splicing rate', xticks=[.1, .4, 1], **kwargs)
pl.hist(df['fit_gamma'], xlabel='degradation rate', xticks=[.1, .4, 1], **kwargs)
scv.get_df(adata_subset, 'fit*', dropna=True).head()
scv.tl.latent_time(adata_subset)
scv.pl.scatter(adata_subset, color='latent_time', color_map='gnuplot', size=80, save='latent_time.pdf')Figure 14. scvelo_latent_time
top_genes = adata_subset.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(adata_subset, var_names=top_genes, sortby='latent_time', col_color='clusters', n_convolve=100, save='heatmap.pdf')Figure 15. scvelo_heatmap
top_genes = adata_subset.var['fit_likelihood'].sort_values(ascending=False).index
scv.pl.scatter(adata_subset, color='clusters', basis=top_genes[:15], ncols=5, frameon=False, save='scatter.pdf')Figure 16. scvelo_scatter
var_names = ['Xkr4', 'Gm37381', 'Rp1']
scv.pl.scatter(adata_subset, var_names, color='clusters', frameon=False, save='XGR_scatter_1.pdf')Figure 17. scvelo_XGR_scatter_1
scv.pl.scatter(adata_subset, x='latent_time', y=var_names, color='clusters', frameon=False, save='XGR_scatter_2.pdf')Figure 18. scvelo_XGR_scatter_2
This comprehensive tutorial has demonstrated the power and versatility of RNA velocity analysis in single-cell transcriptomics using scVelo. We have explored:
As single-cell technologies continue to evolve, RNA velocity analysis stands as a fundamental tool for understanding cellular dynamics and developmental processes. The integration of these approaches with emerging computational methods promises to further enhance our understanding of complex biological systems.
#> R version 4.5.0 (2025-04-11 ucrt)
#> Platform: x86_64-w64-mingw32/x64
#> Running under: Windows 11 x64 (build 26100)
#>
#> Matrix products: default
#> LAPACK version 3.12.1
#>
#> locale:
#> [1] LC_COLLATE=English_United States.utf8 LC_CTYPE=English_United States.utf8 LC_MONETARY=English_United States.utf8
#> [4] LC_NUMERIC=C LC_TIME=English_United States.utf8
#>
#> time zone: America/New_York
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] htmltools_0.5.8.1 kableExtra_1.4.0 knitr_1.50
#>
#> loaded via a namespace (and not attached):
#> [1] svglite_2.2.1 cli_3.6.5 rlang_1.1.6 xfun_0.53 stringi_1.8.7 showtextdb_3.0
#> [7] sysfonts_0.8.9 png_0.1-8 textshaping_1.0.3 jsonlite_2.0.0 glue_1.8.0 sass_0.4.10
#> [13] scales_1.4.0 rmarkdown_2.30 grid_4.5.0 evaluate_1.0.5 jquerylib_0.1.4 fastmap_1.2.0
#> [19] yaml_2.3.10 lifecycle_1.0.4 stringr_1.5.2 compiler_4.5.0 RColorBrewer_1.1-3 Rcpp_1.1.0
#> [25] rstudioapi_0.17.1 lattice_0.22-7 systemfonts_1.3.1 farver_2.1.2 digest_0.6.37 viridisLite_0.4.2
#> [31] R6_2.6.1 reticulate_1.43.0 dichromat_2.0-0.1 showtext_0.9-7 magrittr_2.0.4 Matrix_1.7-4
#> [37] bslib_0.9.0 tools_4.5.0 xml2_1.4.0 cachem_1.1.0